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Abstract 

Tukey’s g-and-h distribution has been a powerful tool for data exploration and modeling since 
its introduction. However, two long standing challenges associated with this distribution family 
have remained unsolved until this day: how to find an optimal estimation procedure and how 
to make valid statistical inference on unknown parameters. To overcome these two challenges, 
a computationally efficient estimation procedure based on maximizing an approximated likeli¬ 
hood function of the Tukey’s g-and-h distribution is proposed and is shown to have the same 
estimation efficiency as the maximum likelihood estimator under mild conditions. The asymp¬ 
totic distribution of the proposed estimator is derived and a series of approximated likelihood 
ratio test statistics are developed to conduct hypothesis tests involving two shape parameters of 
Tukey’s g-and-h distribution. Simulation examples and an analysis of air pollution data are used 
to demonstrate the effectiveness of the proposed estimation and testing procedures. 
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1 Introduction 


Datasets with skewed and/or heavy-tailed distributions are typical in many research areas. There 
have been numerous attempts to search for flexible and practically useful distribution families to 
model such data in the statistical community; see Jones (2015) for a comprehensive review. An 
attractive class of distributions introduced by Tukey (1977), and later named Tukey’s g-and-h 
distribution, has been extensively studied by many researchers; see, for example, Martinez & 
Iglewicz (1984), Hoaglin (1985), Morgenthaler & Tukey (2000). Let Z be a random variable 
from a standard normal distribution, N( 0,1). A random variable, Y, is said to have a Tukey’s 
g-and-h distribution if it is obtained through the transformation 

Y = £ + u,T g , h (Z), (1) 

where £ e R is a location parameter, u> > 0 is a scale parameter, and 

Tg,h{z) = h _1 ( ex P(h-) - 1} exp(h^ 2 /2) (2) 

is a one-to-one monotone function of z G M for h > 0, g £ M. When g — 0, we use the customary 
definition of Tq^z) = lim 5 _> 0 r g,h( z ) — z exp(hz 2 /2). To simplify, from now on, values of all 
quantities involving the parameter g evaluated at g — 0 are defined as their limits attained 
at g —> 0. Aside from £ and u, two additional shape parameters, g and h, are introduced to 
accommodate the potential existence of skewness and heavy-tailness in the data distribution. 
More precisely, g > 0 yields a right-skewed distribution while g < 0 corresponds to a left-skewed 
distribution. In the special case of g = h — 0, the resulting distribution reduces to a normal 
distribution with mean £ and variance u> 2 . By setting h — 0, one obtains a shifted log-normal 
distribution, whereas letting g — 0 gives a Pareto-like distribution. In fact, it has been shown that 
many commonly used distributions can be well-approximated by Tukey’s g-and-h distribution 
(Martinez & Iglewicz, 1984; MacGillivray, 1992; Jimenez & Arunachalam, 2011). In Figure 1, we 
present three examples of using Tukey’s g-and-h distributions to approximate other distributions, 
where all parameters were estimated by the proposed estimation procedure using 10, 000 random 
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(a) Standard Cauchy Distribution 


(b) t-distribution with df=2 


(c) Weibull (2,1) Distribution 





y y y 

Figure 1: Illustrations of using Tukey’s g-ancl-/i density function (red dashed line) to approximate 
other density functions (black solid line): (a) Standard Cauchy distribution; (b) Student t- 
distribution with degrees of freedom 2; (c) Weibull distribution with shape parameter 2 and 
scale parameter 1. 

numbers generated from each distribution. As we can see, all three approximations appear to be 
quite good. 

The flexibility of Tukey’s g-and-h family makes it a powerful tool to model real data arisen 
from many research areas. Examples of applications include modelling short interest rate dis¬ 
tributions (Dutta & Babbel, 2002), air pollution data (Rayner & MacGillivray, 2002a, 2002b), 
extreme wind speed (Field, 2004), the value-at-risk of stock prices (Jimenez & Arunachalam, 
2011), operational risk (Degen et al., 2007), and so on. Field & Genton (2006) proposed a multi¬ 
variate version of Tukey’s g-and-h distribution and used it to study data on Australian athletes 
and on wind speed. He & Raghunathan (2006, 2012) proposed to use Tukey’s r/-and-/i distribu¬ 
tion to perform multiple imputations for missing data. Despite its popularity, there remain two 
unsolved challenges associated with Tukey’s g-and-h distribution: the lack of optimal parameter 
estimation procedures and the lack of valid statistical inference tools. Since a small variation in g 
and h may result in significant changes of the shape of the distribution, an estimation procedure 
with high accuracy is crucial for applying Tukey’s g-and-h distribution to real data. Unfortu¬ 
nately, the most accurate maximum likelihood estimator is not available for Tukey’s p-and-h 
distribution because the inverse function of T g h (-) does not have a closed form unless h = 0, 
which makes the likelihood function intractable. Rayner & MacGillivray (2002a) proposed a 
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method to numerically evaluate the log-density function of Tukey’s g-and-h distribution, but 
their approach lacks resistance (Hoaglin, 2010) in that it can be computationally expensive when 
the sample size is large and it may also be numerically unstable for a reason that we will discuss 
later. To bypass this computational challenge, the existing literature has mainly relied on esti¬ 
mation procedures involving matching sample quantiles (Hoaglin, 1985; Dutta & Babbel, 2002) 
or sample moments (Majumder & Ali, 2008) with their population counterparts. Recently, Xu 
et al. (2014) proposed a new estimation procedure named quantile least square approach to 
estimate the parameters. Although all these methods provide satisfactory parameter estimators 
in many applications, our numerical experience shows that they can be significantly less accurate 
than the maximum likelihood estimator. An additional problem with these methods compared 
to the maximum likelihood estimator is that their estimation accuracies depend on a pre-selected 
set of quantiles or moments, which can be subjective in practice. To the best of our knowledge, 
there has been no study on how to choose an “optimal” set of quantiles/moments to sharpen the 
estimation accuracies for these methods. 

A second long-standing challenge with Tukey’s g-and-h family is how to provide valid statis¬ 
tical inference for the parameters. Although Tukey’s g-and-h distributions were first introduced 
as a tool to explore the data, they also have the potential to be used as an inference tool for the 
underlying distribution. For example, one can make statistical inference on whether the underly¬ 
ing distribution is symmetric by testing the hypothesis g = 0. While there have been numerous 
attempts to improve the estimation accuracy, Xu et al. (2014) were the first to derive the asymp¬ 
totic distribution of their estimator. However, like many other quantile-based estimators, this 
asymptotic distribution also depends on a subjectively selected set of quantiles and it remains 
unclear how will this choice affect the validity of the subsequent inference, especially when the 
sample size is small. Furthermore, the limiting distribution of their estimator is only valid when 
the true value of h, say h 0 , satisfies the condition h 0 > 0. In the special case of h 0 = 0, the 
inference becomes irregular and the limiting distribution can be much more complicated than a 
normal distribution. The reason is that the restriction h > 0, which is necessary to ensure the 
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monotonicity of T g j j(-), makes ho = 0 fall on the boundary of the parameter space. Therefore, 
when h 0 = 0, the regularity conditions in Xu et al. (2014) will be violated and therefore the 
result of Xu et ah (2014) cannot be used to test hypotheses such as h = 0. However, because of 
the special interpretations of the shape parameters g and h, testing g = 0 or h = 0 may be of 
particular interest in many applications. 

In this paper, we aim at removing the bottle-neck of sub-optimal estimation procedures and 
the lack of statistical inference tools for Tukey’s g-and-h distribution. By approximating the 
likelihood function using a much simpler tractable function, we are able to obtain a maximum 
approximated likelihood estimator for parameters of Tukey’s g-and-h distributions, which is 
shown to be as efficient as the true maximum likelihood estimator under mild conditions. In 
addition, we derive the limiting distribution of the proposed maximum approximated likelihood 
estimator and develop valid approximated likelihood ratio tests for a series of hypotheses for 
the shape parameters g and h, regardless of the true value ho equals 0 or not. Our simulation 
studies demonstrate that the proposed approach is much more efficient than the quantile-based 
estimators and reaches the same efficiency as that of the maximum likelihood estimator. 

The rest of the paper is organized as follows. In Section 2, an efficient estimation approach 
based on an approximated likelihood function for Tukey’s g-and-h distribution is proposed and 
related computational issues are discussed. The asymptotic and finite sample properties of the 
proposed maximum approximated likelihood estimators are investigated in Section 3. In Sec¬ 
tion 4, simulation studies are conducted to evaluate the performance of the proposed estimation 
procedure and approximated likelihood ratio tests. An application of our methodology to air 
pollution data is presented in Section 5. The article ends with a conclusion in Section 6 and all 
theoretical results are collected in the Appendix. 
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2 Parameter Estimation 


2.1 Existing approaches 

Denote the parameter vector of Tukey’s g-and-h distribution by G = (£, lu, g , h) T . The log-density 
function of the random variable Y from transformation (1) can be written as 

log f Y \e(y) = log 0 | t~1 | - logo; - log r' h jrj } , ( 3 ) 

where 0(-) is the standard normal density function, and and T gh (-) are the inverse function 

and the first derivative function of respectively. Suppose that we have a random sample 

{yi, ..., y n } and let Y = (yi, ..., y n ) T . Then the maximum likelihood estimator G m i e , n is obtained 
by maximizing the log-likelihood function 

n 

L n{0) = ^\ogf Y \ e {yi). (4) 

1=1 

It is well known that under mild regularity conditions, the limiting distribution of 0 m i e , n has 
the smallest variance. One can further utilize tools such as the likelihood ratio test to make 
statistical inference on 6. Unfortunately, since does not have a closed form, numerically 

evaluating L n {6 ) can be computationally expensive, especially when the sample size is large. 
For this reason, the existing literature has largely been focusing on searching for alternative 
estimators, two of such examples are given below. 

For a pre-selected sequence, 0 < pi < p 2 < ■ ■ ■ < Pk < 1, denote by q pil , q PK the 
corresponding sample quantiles of {jq,... ,y n } and let z Pk = for k = 1where 

< F _1 (-) is the inverse of the iV(0,1) cumulative distribution function. The first approach aims 
at directly matching a sequence of sample quantiles and theoretical quantiles, which we re¬ 
fer to as the letter-value-based approach (Dutta & Babbel, 2002). The letter-value-based es¬ 
timator 6 lv>n = (ii v ,uJi v ,gi v ,hi v ) T is defined as: i iv = q 1/2 , gi v = median fe= i ,... }K {g k }, where 
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as = -zMog ( 




log 


qi/2 _ q J and finally (u>i v , hi v ) are obtained from the linear regression 

9iv{q Pk ~ Qi-Pk) 


exp (gi v z Pk ) - exp (~gi v z. 


■pk> 


= log u + hz p j2, k = l,...,K. 


A second approach was proposed by Xn et al. (2014), named the quantile least square 
estimator 9 q i sn , which is defined as the minimizer of the quantile least square loss function 


K 


Lqls{0) — {q Pk ( l P k,(@)} ) 


k =1 


where q Pk {9) — £ + uJT gi h{z Pk ) is the theoretical pfcth quantile of the random variable Y. 

As one can see, both 9i v , n and 9 q i s , n rely on a suitable choice of p^s. While this choice is 
largely subjective in practice, there has not yet been research on how this choice may impact the 
efficiency and the limiting distribution of resulting estimators. 


2.2 A second representation of log fy\d(y) 

The main difficulty for obtaining the maximum likelihood estimator lies in the lack of closed form 
for log fY\o(y)- A natural idea is to find an explicitly computable function that can approximate 
log f Y \e(y) well. To do so, we first define p 0 (y) = F Y \o(y) and z pg ( y) = {pe{y)} , where F Y \e{-) 

is the cumulative distribution function of Y with a parameter vector 9. By this definition, it 
is straightforward to show that z pg ( y ) = Then log j Y \ 0 (y) in (3) can be written as a 

function of z Pg ( y y, that is, 

Veizpoto)} = l °SfY\e(y) = log 4>{z My) } - logo; - log r' gh {z pg{y) } 

= ~ lo S [exp {gz My) } + g^iexpigzp^) - 1 }hz Pg{y) \ - logt j - ^ log(27r). 

The notation z pg ( y ) is used here to emphasize that z pg ( y ) is an unknown quantity depending 
on 9 and y. For simplicity, from now on, we write z pg ( y ) as z p whenever there is no ambiguity. 
Recall that p is the cumulative distribution function of Y evaluated at the observed value y. 
Figures 2(a) and (b) depict an example of log f Y \ 0 (y), or equivalently (p 0 (z p ), as a function y 
and z p , respectively. From Figures 2(a)-(b), we can see that the observed data {jq,..., y n } are 
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sparsely distributed in (— 00 , 00 ). As a function of y, log f Y \e(y) varies rapidly on the left-hand 
side of the sharp spike in the middle. On the contrary, within a bounded interval, pg(z p ) is a 
slowly varying function of z p , which makes it easier to approximate using some numerical method. 
Furthermore, although tpo(z p ) is also defined on (— 00 , 00 ) in theory, with a finite sample size 
n, one only has to focus on a bounded interval [—6 n ,6 n ] for some b n > 0. To see this, consider 
the case when 0 = G 0 with 0 0 = (£ 0 , oj 0 , g 0 , h 0 ) T being the true value of 6 that generated the 
data and choose b n = <h _1 {l — l/(nlogn)}. In this case, the Pi = F Y \g 0 (y ? ;)’s follow a uniform 
distribution on [0,1] and the corresponding z Pi 's follow the N( 0,1) distribution. Straightforward 


(a) 



y 


(b) 



z p 


(c) (d) 




Figure 2: The log-density function of a Tukey’s g-and-h distribution with (£,oj,g,h) = 
(0,1, 0.5, 0.2). (a) log f Y \o(y) as a function of y; (b) Pe{z p ) (solid line) vs p>g(z p ) (dashed line, 
K n = 15, b n = 6) as functions of z P) (c) §ft-(% 0 as a function of z p , (d) pe(z p ) as a function of p. 
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algebra yields that 


P 


( max Zpi 

\ 1 <i<n 



i - <&(M n 


1 - 



i 

n log n 


n 

—>■ 0 as n —> oo, 


where $(•) is the standard normal cumulative distribution function. In this sense, with a sample 
size n = 10, 000, using b n = 4.25 is already a good choice. Therefore, for 6 in a sufficiently 
small neighborhood of do and a finite sample size n, it is reasonable to treat ipe{z p ) as a function 
defined on a bounded support [—b n ,b n \. Our numerical experience shows that it is generally 
sufficient to take b n = 10 in practice. 


2.3 Maximum approximated likelihood estimator 


In this section we develop a numerical algorithm for computing the log-density value, log fy\e(y) = 
(fg(zp), at an observation y for a known parameter vector d, which cannot be computed directly 
because the value of z p is unknown due to the fact that does not have a closed form. 

To overcome this, we propose the following approach to approximate the value of z p . With 
a sample size n and a pre-given b n > 0, we first introduce K n equally spaced knots over the 
interval [—b n ,b n ], denoted as — b n = Zi < Z 2 < • • • < Z k„ = b n , and then compute the corre¬ 
sponding knots in the transformed scale as Y k,e = £ + UT g +(Zk), k = 1,..., K n . For a given 
V ^ [Yi,e,Y K n ,e\i we find the knot Z^ such that Yu,e < y < Yk+i,e- The monotonicity of 
ensures that the z p associated with y must satisfy Z^ < z p < Z^+i. Instead of computing z p by 
numerically solving the equation y = £ + uTg t h(z p ), we use the following linear approximation 


Zp,k ~^-k + 


y - Y fc ,6 


Y 


k+1,0 


- Y 


k,0 


- fc +1 


Z k ) if Y k ,e < y < Y, 


k+1,0- 


( 6 ) 


Because | z p + — z p \ < 2 b n /K n by design, we can expect that if K n is sufficiently large such 
that b n jK n -+ 0, z P) k should approximate z p well. Then, for any observed value y such that 
y = £ + UT g +(z p ), we can define an approximation function for (pg(z p ) as 

I<n~ i 

My) = <pe(z p ) = ^(4fc) J [Y fe , e .Y fc+1 , e ](f/), 

k =1 
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(7) 



where Ia(v) = 1 if y G A and 0 otherwise. In Figure 2(b), we can see that <pe(z p ) is a piecewise 
convex function yielding a good approximation, even though only K n = 15 knots were used in 
this example. To see why this is the case, it is straightforward to show that, for any given 6, 


sup \<Pe(zp) : -<Pe(zp)\<j!r- sup ^z~( z P ) , 

2/£[Y;i,0,Yx n ,e] z p e[-b n ,b„] & z p 

where the notation cd is used here to distinguish it from the usual partial derivative. More 
specifically, by using cd, we treat z p as an argument of the function (pe{z p ) that is independent 
of 6, even though z p is actually a function of 6. For example, 


(kpe_ 

dh p 



-(1 + h)z p — g — 


g Hexpfosp) - 1} + z p 
exp (gz p ) + g~ 1 {exp(gz p ) - 1 }hz p 


Figure 2(c) illustrates that for the same example used in Figure 2(b), §f^-(z p ) is essentially 
bounded within the interval [— b n ,b n ] and therefore sup yg [ Yl g y K e ] \<pg(z p ) — <pg(z p )\ —> 0 as long 
as b n /K n —» 0 when n —> oo. 

Given a random sample {yi,... ,y n }, we can repeat the above process to calculate pi = 
F Y \o{'Ui) and z Pi = <F' 1 (pj) for each data point. Then, the log-likelihood function (4) can be 
approximated by the function 


L n (G) 


y~)i — i '&0 {Vi) i if Yi Q ^ |/min ^ f/max — Y K n ,9i 


—OO, 


otherwise, 


( 8 ) 


where i/jg(y) is defined in (7) and y m in, ?7max are the smallest and largest observed values. The 
maximum approximated likelihood estimator, 0 ma ; e ,n, is defined as the maximizer of (8). 


2.4 Some computational issues 

Rayner & MacGillivray (2002a) proposed a similar “change of support” approach to numerically 
approximate log fY\e{yi) for each y % , i = 1,. .. , n. The idea is essentially to numerically solve 
y = £ + ujTgjJzp) for p and then plug the solution p back into log /y|e(y) = (pe(z p ). However, 
this approach can be computationally expensive when n is large. In addition, as illustrated in 
Figure 2(d), (pe(z p ) as a function of p is rather steep on both ends of the interval [0,1], which is 
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very different from Figure 2(b). Consequently, for p close to 0 or 1, a small error in p can lead to 
non-negligible errors when evaluating (po(z p ) using tp e (zp). In this sense, Rayner & MacGillivray 
(2002a)’s method can also be numerically unstable. 

The computational complexity of L n {0 ) is equivalent to assigning n data points to K n — 1 bins 
in a histogram, which can be efficiently implemented using some bucket sort algorithm (Corwin 
& Logar, 2004). The average computational complexity for such an algorithm can be easily 
achieved as 0(n + K n ), which is fast even for very large n and K n . In this paper, we use the 
function .bincode from the R software (R Development Core Team, 2014) to implement this 
algorithm. Based on Theorems 1-2 below, to guarantee estimation efficiency, we need to ensure 
that K n is large enough so that Condition 5 in the Appendix is met. Our numerical experience 
indicates that using K n = max(1000, n) is sufficient for most applications. With such a choice of 
K n , our simulation studies in Section 4 demonstrate that the proposed estimation method can 
be several hundred times faster than the numerical maximum likelihood estimation approach 
proposed in Rayner & MacGillivray (2002a). 

Another important issue in maximizing L{0) defined in (8) is the choice of initial values for 0. 
In the Supplementary Material, we give the Gradient and the Hessian matrix of L(0), which can 
be utilized for any standard optimization routine. We use the “L-BFGS-B 1 ' method (Byrd et al. 
1995) provided in the R function optim to maximize L{0) since it is a constrained optimization 
problem due to the restriction h > 0. To find 0 ma ie,m the initial value 0° must be chosen such 
that Condition 1 in the Appendix is met, that is, Y 1£ >o < y min < y max < Y Kn0 o. Based on our 
discussion in Subsection 2.1, we need essentially to find a 0° that is reasonably close to the true 
value 0o- Fortunately, such a requirement can be easily fulfilled by taking 0° as the letter-value- 
based estimator 0|, vn or the quantile least square estimator 0 q i Stn . In this paper, we choose 0i VjU 
as the initial value of our algorithm. 

Finally, we would like to point out that, unlike the quantile-based estimators 6i V)n and 0 q i s , n , 
whose performances depend on a subjective choice of quantiles, the estimation accuracy of the 
maximum approximated likelihood estimator 0 ma ie,n does not depend on the positions of the 
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chosen knots as long as the number of knots K n is sufficiently large. This is reflected in the fact 
that the results of Theorems 1-2 do not depend on K n at all. In addition, we have proposed 
to use equally spaced knots merely for the simplicity of our technical investigations. Generally 
speaking, any knot sequence — b n = Zi < Z 2 < • • • < Z Kn = b n satisfying the condition that 
inf i<i<K n -i |Zj + i — Zj| —* 0 sufficiently fast as n —> oo will do the job. 

3 Statistical Inference 

It is often desirable to provide an uncertainty measure for a point estimator in statistical re¬ 
search, which remains a challenge when fitting Tukey’s g-and-h distribution to data. For the 
popular letter-value-based approach, the asymptotic distribution of 0i ViH remains unknown. Al¬ 
though the quantile least square approach proposed by Xu et al. (2014) results in an asymptotic 
normal distribution for 0 q i s>n , there remain two problems to be addressed. First, the asymptotic 
distribution of 0 q i s , n depends on a preselected set of pk s and it remains unknown how this choice 
affects the subsequent statistical inference. Furthermore, their theory relies on the implied as¬ 
sumption that h > 0 and does not hold if the true value of h is 0. However, for Tukey’s g-and-h 
distribution, h = 0 is a special case of particular interest. On the one hand, by testing h — 0, 
one can tell whether or not the data have heavy tails. On the other hand, if we have significant 
evidence to believe that the true value of h is 0, then letting h — 0 would make the function 
Tg,h(-) invertible and thus makes the maximum likelihood estimator achievable. 

3.1 Asymptotic properties of Q ma i e , n 

Denote by ff = Mx (0, oo) xlx[0, oo) the parameter space of 6 = (£, oj, g, h) T and define U n (0) = 
dL n (0)/d0 , U n {0) = dL n (0)/d0 and J n (0) = —d 2 L n (d)/ddd6 T . Let J(0 O ) be the expected value 
of I n (6) when 6 = 0 o and define a random vector Z = (Z i, Z 2 , Z%, Z^) T ~ 1V4(0, / 1 (0o)), where 
^ 4 ( 0 , J _ 1 (0o)) is a 4-dimensional multivariate normal distribution with mean 0 and covariance 
matrix J _1 (0o)- The following theorem gives the asymptotic distribution of 0 ma i e , n under the 
scenario ho = 0 or ho > 0. The proof is given in the Appendix. 
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Theorem 1. (Asymptotic Distribution) Under Conditions 1-5 in the Appendix, as n —> oo and 
K n —> 00, we have: (a) If the true value h 0 > 0, then y/n{9 ma i e ^ n — 9 0 ) —> Z in distribution; (b) 
If the true value h 0 = 0, then 

( Z x - I lA II^Zi 
Z 2 - Z 24 // 44 ^ 

z 3 - j 34 // 44 z 4 

V 0 

where is the ij-th element of I~ 1 (9 0 ), Q = (£ ,iv,g,h) T and 0 0 = (£ 0 , uo, go, h 0 ) T . 

Theorem 1 essentially states that when ho > 0, if K n —> 00 fast enough as n —* 00, 9 ma ie,n has 
the same asymptotic distribution as O m i e , n and reaches the Cramer-Rao efficiency lower bound. 
This is confirmed by our simulation example 1 in Section 4. When h 0 = 0, asymptotically, O ma i e:n 
has a mixture normal distribution and there are 50% of times that h will be estimated as exactly 
0. In the Supplementary Material, we give a detailed description of how to compute U n (6) and 
I n {9) numerically and thus approximate the information matrix I(Oo) by I n (9 m aie, n )■ 

Based on Theorem 1, we can see that it is critical to determine whether ho = 0 or not because 
the limiting distributions of 9 ma i e ^ n under these two scenarios are radically different. So the next 
question becomes: can we perform a hypothesis test for Hq: h = 0? Fortunately, the answer is 
yes. Let C be a subset of the parameter space of 9. We define the approximated likelihood 
ratio test statistic for testing the null hypothesis H 0 : 9 G O 0 as 

D n = — 2{ sup L n {9) - sup L n (9)}. (9) 

Oefio 0eO 

In this paper, we consider three types of O 0 ’s: = lx( 0 , 00) x{ 0 } x [ 0 , 00), = Kx ( 0 , 00) x 

M x { 0 } and = Rx( 0 , 00) x { 0 } x { 0 }, which correspond to testing : g = 0 , : h = 0 , 

i^o : h = g = 0 , respectively. The following theorem gives the limiting distributions of D n under 
these three null hypotheses. The proof is given in the Appendix. 

Theorem 2. (Approximated Likelihood Ratio Tests, ALRT) Under Conditions 1-5 in the Ap¬ 
pendix, as n —» 00 and K n —> 00 , 
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1. Under : g = 0, one has D n —> D ~ Xi distribution; 

2. Under Hq 2 ' 1 : h = 0, one has D n —* P ~ 0.5xo + 0.5;\;i distribution; 

3. Under : g = h = 0, one has D n —> D ~ 0.5x 2 + 0.5x2 distribution; 

where x% is the chi-square distribution with df degrees of freedom and 0.5x1^ + 0.5xi/ 2 stands for 
a mixture distribution of Xdp and \% 2 w ith 50% of each component. In particular, Xo represents 
the distribution with a point mass at 0, i.e. P(X = 0) = 1 z/X ~ x o- 

The critical values of the mixed x 2 distribution can be found using the function qchibarsq in 
the R package “emdbook” (Bolker, 2008). From Theorem 2, it is interesting that D n for testing 
H ( (} ]> : g = 0 has the the same limiting distribution, whether or not the value of h 0 is 0. Under 
the null hypotheses Hq : h — 0 or H () : g = h = 0, the first component of the asymptotic 

distributions of D n comes from the cases when h ma i e ^ n = 0, which should happen, asymptotically 
50% of times by Theorem 1. 

3.2 Finite sample performance of ALRT tests 

Although by Theorem 1, when the sample size n —> oo, the occurrence rate of h being estimated 
as 0, denoted by P 0 n , should approach 0.5, our empirical studies show that P 0 ,n is almost always 
bigger than 0.5 when n is small. In our simulation studies, we noticed that, for a fixed n, Po,n 
is closely related to another quantity, U 0 , n = P(S n < 0), where S n is the fourth entry in the 
vector n~ l t 2 U n {6 o). As illustrated in Figure 3 (b), when the sample size n grows, the difference 
between Po, n and Co, n becomes smaller and smaller. An intuitive explanation is that if S n < 0, 
the approximated likelihood function L(6) cannot be improved by moving the parameter h away 
from 0 toward oo, given that the other parameters are fixed at (fo, uo, go), arid therefore h = 0 
should be the resulting solution. Although the explicit relationship between Po ir) , and Co, n remains 
unclear, studying the behavior of Uo, n will provide some insights on why P 0in —* 0.5 at such a 
slow rate. 

In the special case of g 0 = ho = 0, some tedious algebra yields that S n = —nr 1 / 2 J2i=i( z t /3 — 
zf) with zf s being independent samples from the iV(0,1) distribution. By the Central Limit 
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Theorem, S n converges in distribution to a normal distribution with mean 0 and thus C'o. n = 
P(S n < 0) — > 0.5 as n —> oo. However, even for a large n, the finite sample distribution of S n 
can still be severely skewed to the left due to the existence of the term zf in the summation. 
As a result, even for a large n, we can still have Co, n > 0.5. Our simulation studies show that 
this left-skewness in S n becomes even more sever when \g 0 \ deviates from 0. Because of the 
association between P 0n and C 0)Tl , this partially explains why P Qn is always greater than 0.5. 

Next, we use a small simulation study to illustrate this phenomenon. The data were generated 
using model (1) with £ 0 = cn 0 = 3, h 0 = 0 and g 0 = —0.5, —0.4,..., 0.4, 0.5. In Figure 3, for each 
n, we plot the values of Po,n and Co, n obtained using different values of go in a boxplot. We can 
see that Po, n is always greater than 0.5 regardless of the values of go and n and that it slowly 
regresses to 0.5 as n increases. In addition, as n increases, the difference Po, n — C 0 n consistently 
decreases for various values of g 0 , indicating strong association between P 0)n and Co, n - 
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Figure 3: (a) Percentage of times, P 0 ,n> when h ma i e , n — 0 with /i 0 = 0; (b) Difference between 
Po,n and Co, n - Po,n ~ Co ,n vs sample size n. 


Such a phenomenon indicates that, for a finite n, the critical values based on the asymptotic 

(2) (3) 

distributions in Theorem 2 are often too large for the null hypotheses and Hq . For example, 
to test H q , if the estimated value for h is h rna i e . n = 0 under the alternative hypothesis, the 
resulting approximate likelihood ratio test statistic D n = 0. And if h ma i e ,n = 0 occurs more 
than 50% of times, then the finite sample distribution of D n consists of more than 50% of 0’s 
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when H { 0 2) is true, which means we will fail to reject H^ more often than (1 — a) 100% of times 

using the (1 — a)th quantile of the 0.5^0 + 0.5yf distribution. Similar arguments apply to H q 3) 

too. This explains why the size of the approximated likelihood ratio tests for H q and Hq 

are often smaller than the nominal level in the simulation example 2. However, even with such 

( 2 ) 

conservativeness, the approximated likelihood ratio tests for Hq ' and H (i still display significant 
local powers. 

4 Simulation Studies 

In this section, we use some numerical examples to show the effectiveness of our method. The 
data were generated using model (1) with different values of 6 0 = (£ 0 , ojq, do, ho) T and five sample 
sizes n = 100,200,400,1000,2000. All summary statistics were based on 1000 repetitions and 
for each repetition, we used b n = 10 and K n = max(1000,n). 

4.1 Evaluation of the estimation accuracy 

In the simulation example 1, we mainly focused on comparing estimation accuracies for differ¬ 
ent methods described in Section 2. Three alternative approaches were considered: the letter- 
value-based method (LV), the quantile least square method (QLS) and the numerical maximum 
likelihood estimation (NMLE, Rayner & MacGillivray, 2002a). For the LV method, six quantiles 
were used: pi v = (0.005, 0.01, 0.025, 0.05, 0.10, 0.25) T . For the QLS method, 10 quantiles were 
used with p t = , i = 1,... , m and m = 10. For the NMLE method, the function uniroot 

form the R package was used to numerically solve y = £ + uJT g ^(z p ) for p. 

We set 0 0 = (£o, cao, go, ho) T = (3, 3, 0.5, 0.2) T and the empirical standard errors based on 
1000 simulation runs are summarized in Figure 4. The asymptotic standard error derived from 
Theorem 1, which is also the Cramer-Rao efficiency lower bound of the maximum likelihood 
estimator (MLE), is reported as well. We can see that the maximum approximated likelihood 
estimator (MALE) is uniformly more accurate than both LV and QLS methods for all sample 
sizes. In terms of estimation efficiencies, the NMLE and MALE methods are almost identical 
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and reach the efficiency lower bounds, which confirms the theoretical findings in Theorem 1. In 
addition, the LV method did a better job at estimating the parameters g and h than the QLS 
method did, while it seems to be the reverse for estimating £ and c o. Additional simulation results 
using different values of 6 0 can be found in the Supplementary Material, where the findings are 
all consistent with Figure 4. 
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Figure 4: The estimation efficiencies of four methods for Tukey’s r/-and-/i distribution; MLE: 
asymptotic standard error of the maximum likelihood estimator; MALE: maximum approxi¬ 
mated likelihood estimator; NMLE: numerical maximum likelihood estimator; QLS: quantile 
least square estimator; LV: letter-value-based estimator. 


Although the MALE and NMLE methods yield roughly the same estimation accuracy, the 


proposed MALE method is more computationally efficient. To show this, we record their CPU 


times (in seconds) from the above simulation study. The same initial values and optimization 
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routine were used for both methods. All simulation runs were carried out in the software R on a 


cluster of 120 commodity Linux machines using a total of 300 CPU cores, with each core running 
at approximately 2 GFLOPS, or 2 billion floating point operations per second. The results based 
on the 1000 simulation runs are summarized in Table 1, where the first two columns are average 
CPU times (in seconds) for each simulation run for both methods and the third column is the 
average of the ratios of CPU times for these two methods applied to the same dataset. From 
Table 1 we can see that, when the sample size grows to n = 2000, our method is on average 
over 400 times faster than the NMLE method. We also would like to point out that the MALE 
method converges much faster than the NMLE method for almost all simulation runs, which also 
contributes to the smaller computation time of the proposed method. 


Table 1: Comparisons of computational costs for MALE and NMLE 



Average CPLT time per run 

Average of CPU time ratio per run 

sample size n 

MALE 

NMLE 

NMLE/MALE 

100 

0.13 

10.86 

91.68 

200 

0.14 

22.60 

173.80 

400 

0.17 

47.15 

301.86 

1000 

0.29 

113.68 

445.58 

2000 

0.53 

229.01 

464.58 


4.2 Evaluation of approximated likelihood ratio tests 

In the simulation example 2, we study the empirical size and the local power of the proposed 
approximated likelihood ratio tests. The null and alternative hypotheses are 

1. H, <■> : g = 0 vs Hi : g = djsfn with (di, d 2 , d 3 ) = (0,1.5, 3). The data were generated using 
model (1) with (£ 0 , u 0 , g 0 , h Q ) = (3,3, d/y/n, 0.2); 

2. Hp : h — 0 vs H\ : h = d/pn with (d\, d 2 , ofa) = (0,0.5,1.0). The data were generated 
using model (1) with (£ 0 , u>o, g 0 , ho) = (3, 3, 0.5, d/y/n); 

3. Hp : g — h — 0 vs Hi : g — 3 d/y/n,h = d/pn with (di,d 2: d 3 ) = (0,0.5,1.0). The data 
were generated using model (1) with (£ 0 , ^o, go, h 0 ) — (3,3,3 d/pn,d/pn). 
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Table 2: Empirical powers of the approximated likelihood ratio tests in the simulation example 2. 




H 0 

<■ 9 = 

0 

Ho 

: h = 

0 

Ho'. 

g = h 

= 0 



Nominal levels 

Nominal levels 

Nominal levels 

d 

n 

10.0 

5.0 

1.0 

10.0 

5.0 

1.0 

10.0 

5.0 

1.0 

di 

100 

12.3 

6.3 

1.8 

4.0 

2.0 

0.4 

7.7 

4.5 

1.1 


200 

11.4 

5.7 

1.4 

4.5 

2.2 

0.1 

9.6 

4.7 

0.8 


400 

10.4 

5.0 

1.6 

5.8 

2.4 

0.5 

8.7 

4.0 

0.4 


1000 

9.8 

3.8 

0.5 

5.6 

2.7 

0.3 

8.9 

3.9 

0.5 


2000 

00 

bo 

4.1 

1.3 

7.1 

3.4 

0.8 

8.9 

4.6 

0.7 

C?2 

100 

30.2 

19.9 

7.0 

27.9 

20.6 

8.7 

56.4 

46.8 

30.0 


200 

27.9 

19.4 

5.8 

33.9 

25.4 

10.4 

58.9 

47.7 

29.2 


400 

28.0 

17.6 

6.3 

40.4 

29.0 

11.6 

60.5 

49.3 

30.3 


1000 

28.5 

19.0 

7.2 

47.8 

33.1 

15.6 

64.6 

53.2 

32.5 


2000 

28.1 

18.4 

7.9 

53.7 

38.8 

17.8 

65.0 

53.6 

30.7 

C?3 

100 

67.4 

55.9 

30.8 

58.0 

48.7 

30.3 

93.0 

88.6 

78.4 


200 

68.6 

57.3 

32.6 

68.2 

58.7 

41.7 

95.3 

93.3 

84.0 


400 

67.7 

57.9 

35.9 

79.5 

69.7 

49.8 

97.2 

95.2 

88.4 


1000 

70.9 

59.8 

35.2 

83.2 

74.8 

58.9 

98.5 

96.6 

91.7 


2000 

68.9 

58.0 

34.9 

89.5 

83.3 

63.9 

98.8 

97.5 

92.1 


To study the ability of the proposed tests in detecting small departures from the null hy¬ 
potheses, we deliberately set up alternative hypotheses as a function of y 7 n , which are commonly 
referred to as local alternatives in the literature. Studying the power of hypothesis tests under 
local alternatives is more accurate in characterizing asymptotic behavior of the test statistic and 
has been widely used, for example, see Zhang (2001) and Xu & Wang (2010). The empirical 
powers based on 1000 simulation runs are summarized in Table 2. For and H^\ we see 
that under the null hypotheses, the empirical sizes are quite close to the nominal sizes. However, 
under the null hypothesis H q , the empirical sizes are much smaller than the nominal level, even 
for relatively large n. In other words, in such a situation, the approximated likelihood ratio test 
tend to be conservative, which confirms our discussion in Section 3.2. However, all tests possess 
significant local powers in detecting very small deviations from the null hypotheses and these 
local powers increase when the departure from the null hypothesis grows from c^/a/^ to d^j\fn. 
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5 Air Pollution Data 


In this section, we apply the proposed method to two years of air pollution data obtained 
from Ledolter & Hogg (2009, Chapter 1, Section 1.4), which can be downloaded at the web¬ 
site http://www.biz.uiowa.edu/faculty/jlcdolter/AppliedStatistics/. In each of 1976 and 1977, 
64 weekday afternoon lead concentrations, in micrograms/m 3 , were collected near the San Diego 
Freeway in Los Angeles. The same datasets were also used by Rayner & MaeGillivray (2002a, 
2002b). In Figure 5, the histogram and the density plots show possible skewness and some 
outliers in both datasets. Following Rayner & MaeGillivray (2002b), we use Tukey’s g-and-h 
distribution to fit both datasets. 


1976 1977 



O | | | i i i i i o i i i 

2 4 6 8 10 12 14 16 5 10 15 

Lead concentration Lead concentration 

Figure 5: Histograms of lead concentrations in 1976 and 1977. 

For parameter estimation, the four methods described in Section 4.1 were used. All estimates 
are summarized in Table 3. As we can see, the estimates 6 ma ie,n an d 0 nrn i e , n are identical. This 
is not surprising because both of these methods should yield estimators that are very close to 
the true maximum likelihood estimator. Although 0i Vj7l and 9 q i sn yield consistent estimates of 
£, u> and g, compared to O ma ie,n and O nm i e ,m they produce very different estimates for h. For 
example, for 1977, 9 q i s . n gives h q i Sin = 0.49, which is very questionable because with this value 
the fitted distribution basically has an infinite variance. 
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We also ran a series of hypothesis tests using the approximated likelihood ratio test statistics 
and the results are summarized in Table 3, where C 0.95 stands for the critical value derived from 
the asymptotic distribution given in Theorem 2 for each test at the 0.05 significance level. Based 
on these tests, there is significant evidence that h 0 is not 0 but it seems reasonable to set g — 0 for 
both datasets. This indicates that for both datasets, the data do not have a normal distribution 
but a heavy-tailed symmetric distribution. 


Table 3: Estimates and hypothesis tests of lead concentration data. 



Methods 

Year 1976 

Year 1977 


* 7 male,n 

(7.11,1.60,0.19,0.12) 

(9.38,1.22,0.04,0.34) 

Estimation 

@nmle,n 

(7.11,1.60,0.19,0.12) 

(9.38,1.22,0.04,0.34) 

(€,w,g,h) 

@lv,n 

(6.95,1.62,0.25,0.07) 

(9.40,1.33,0.04,0.21) 


@qls,n 

(7.12,1.62,0.24,0.01) 

(9.41,1.08,0.01,0.49) 

Hypothesis 
Tests 
{Dm C 0 . 95 ) 

H 0 : g = h = 0 
Hq\ h — 0 

H 0 - g = 0 

(7.26, 5.14), Reject; 

(4.46, 2.71), Reject; 
(1.54,3.84), Fail to reject. 

(14.34,5.14), Reject; 
(14.05,2.71), Reject; 
(0.03,3.84), Fail to reject. 


6 Conclusion 

In this paper, we proposed a computationally efficient estimation method for Tukey’s g-and-h 
distribution. The proposed maximum approximated likelihood estimator has the same limiting 
distribution as the true maximum likelihood estimator and the resulting approximated likelihood 
ratio tests can be used to test interesting hypotheses involving the two shape parameters g and h. 
The performances of the proposed method have been demonstrated through extensive simulation 
studies and an application to a air pollution data. 

There are several ways to extend the current results. Tukey’s g-and-h distribution can be 
used as the residual process of a regression model to conduct robust model selection and model 
averaging (Xu et ah, 2014). Another interesting application is to conduct model selection for 
an autoregressive model with an infinite variance as in Xu et al. (2012), where Tukey’s g-and-h 
distribution can be used to replace the stable distribution as the innovation process. It would 
also be interesting to conduct some theoretical investigations on the local powers of the ALRT 
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tests in Section 3 and how to construct valid confidence intervals for g and h, without knowing 
the true value h 0 = 0 or not. 


Appendix: Technical Proofs 

We first define some functions Ci(z,B) = \z p =z, C 2 (z,0) = \ Zp =z, C 3 (z,0 ) 


ddd^zp i z p= 


T g,hO d<pg(z p ) dz p i ^ / n \ _ 1 9 lo g T g,U z ) dyg(z p ) i ^ / m _ dy e (z p ) d 2 z p i w .• . 

r 9 ,h( 2 ) 9+2 p 90 ' z p =z ' > 00 17 1 0t Zp Up=z- vve USL 

some technical conditions needed for proofs of Theorem 1-2, where Conditions 2-3 are taken 
from Self & Liang (1987). 

Condition 1: There exists an open neighborhood of 0 O , denoted by @o,n Q M 4 , such that 
Y< y m \n < W < Y K n ,o with probability 1 for all 0 e @ 0 

Condition 2: The first three derivatives of L n {0) with respect to 0 exist with probability 1 on 

©o,n H 0; 

Condition 3: The third derivative of n~ 1 L n (0) is bounded by some function if n ( Y) with 

\E{^n(Y)}\ < 

Condition 4- 1(0) = E{I n (0)} is positive definite on @ 0 , n and 1(0 o) = var \n Y 2 f/ n (0o)}; 
Condition 5: AgA. sup e^ n m,ze[-b n ,b n ] I C j(z, 0)\ ->■ 0 for j = 1,..., 5 as n ->■ oo and K n ->■ oo. 

Lemma A.l. Under Conditions 1-5, as K n —>■ oo and n —>• oo, we /lave that 


4= sup |L n (0)-L n (0)| = o p (l), (A.l) 

V n 0600 ,nOO 

4= SU P |^nW-t/nW| = O p (l). (A.2) 

V n 060o,„nQ 

Proof. For a fixed 0 e @o,n H Q, let z p = r“^(^) and suppose < z p < Zk+i for some 
1 < k < K n . Then we can define the corresponding z p ^ as in (6) such that | z p — z p ^\ < 2 b n /K n , 
which implies that 

We(zp) -<Pe(z p )\ < ^ sup ^~(z p ) . 

^n z p e[-b n ,b„] O *-p 
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Then, it is straightforward to show that, under Conditions 1 and 5, we have 


1 ^ A 

—j= sup \L n (d) - L n (d)\ < sup — / =y^\if 0 (z Pi )-(f> 0 {z Pi )\ 

V n 0e© o ,nnn 0£©o,n,no v ^ 

< 2 ^ bn sup \Ci(z p , 0)\ = o p (l). 

0&@o,nCiQ,Zp£[-b n ,b n ] 


By definition, we can rewrite (6) as 


z p,k ~^-k 


y~Yk,e 
Yfc+1,0 — g 


(Zfc+l Zfc) ~^-k 


T g,h{z p ) r gi fe(Zfc) 
Fj,/i(^fc+l) 7~g,h(^-k) 


(Zfc+l — Z fc ) 


Applying the chain rule of differentiation, we have 


M^p) = 


d(pe(z p ) _ d(po(z p ) dipo(zp)dz p 


We + We = Ve{Zp) + 


where we denote v 0 (z p ) = °% ( 0 P> , w 0 (z p ) = and x 0 (z p ) = Similarly, we have 

a _ _ ^e(^P,fc) | fy>o(5p,fc)d5p,fc _ ,, /- W / x 

M0(- P ) dQ gw + g] ^ k ge v e {z ptk )+w 0 {zp, k )x 0 {zp), 

where we denote x 0 (z p ) = Straightforward algebra yields 

U 0 {z p ) - Uo(Zp) = Ug(Zp) - Ue{Zp )k ) + W 0 (z p , k ) {x 0 (z p , k ) - X 0 (z p )} 

d^x) 0 _ _ _ (A.3) 

<90<9t~ Z P,k) A W 0 \Z pk ) \X 0 (yZp tk ) X 0 (^p)} , 

where z* j e [Z^,Zfc + i). Furthermore, we can show that 


£ 0 (z p ) - X 0 (Zp, k ) = X 0 (z p ) - x 0 (z p ) + x 0 (z p ) - x 0 (zp )k ) = - x 0 (z p ) + x e (^p) - x 0 {z Pik ) 

= + W “ Xe ^ + “ X 0 (Wk) 

/ Tg,h(Zp) -j , 1 9^p dZp^ k . . (~ \ 

~ \T,, h (z w ) ~ M ~ k) ~ j M + W + Ie(2 * ,) “ 

Qt' 

_ T g,h( Z p,3)( Z P ~ Z p,2) dz p -^t( z p,4)( z p,5 ~ Z tfi) ^ z p , * w 


(A.4) 




V-p,4/V-p, 0 -p,Q/ , \(y \ 

T' lJt (z; :2 )K n + aesiz„ {z ^ ,{z - " r * h 


where Z*, < zA < Z/,. +] with 2 < i < 7. Then, combining (A.3) and (A.4), under Condition 5 


22 



and the fact that Z k+1 — Z k = 2 b n /K n , we have 


sup \fn\ue(z p ) - u e (z p )\ = o p ( 1), 

eee 0 nn,z p e[-b n ,b n ] 


which further leads to, under Conditions 1 and 5, that 


1 ^ ^ 

—= sup \U n (9) — U n (0)\ < sup —=y^\u e (z Pi )-u e ( 
V n 6»ee 0 no 6»G© 0 ,„no V n 


Z Vi> 


< sup y/n\u 0 (z p ) —ue(z p )\ = o p (l). 

0G©o, n nO,5;pe[— b„,b n \ 


□ 


Before proceeding to proofs of Theorem 1-2, we first define quantities ( n = n L I 1 (9o)U n (0o) 
and H n (0) = {Cn - (0 - 0o)} T /(0o){Cn - (e - 0o)} + n- 2 C^(0o) r /- 1 (0o)^(0o). 

Proof of Theorem 1. Under Conditions 2-4, by Lemma 1 in Self & Liang (1987), for any 0 such 
that \/n(9 — 9 0 ) = 0(1), we have 

L n {6 ) - L n (6» 0 )} = 77 n (6>) + R n (9), (A.5) 

where R n {9) = O. p (l)\\0 — 0 0 11 3 • 

Following the same arguments in the proof of Theorem 1 in Self & Liang (1987) and using 
the equation (A.l) in Lemma A.l, we can show that there exists a sequence 0 ma ; e ,n £ such 
that | O ma i et1l — 9 0 \ = O p (l/ \/n). Denote by 6 h„ the maximizer of the quadratic function H n {9). 
Because of the convexity of D, it is straightforward to show that 1 0 Hn — 9 0 \ = O p (1/a/u) and 
hence |0. m aZe,n — &H n \ = O p (l/y/n). Therefore, we must have that 

2 _ _ 

\R n \9 ma i en ) L n (Qfi )} 
n 

2 ~ - ~ ~ 2 - 2 - ~ 

= ~\L n \9 ma le,n) Z/ n (0// n )} ~{L n \0m a le, n ) L n {0 T ~{L n {0 ma if) L n (0//)} 

= —(A n (0 ma ; en ) — Lr)(0maZe,n)}-{-^n(^i/,J “ L n {9 H )} H-{ Animate,n) — L n (0 H )} (^-6) 

n n n 

= ~{U n (dl ) - U n (0()}(0 ma /e,n - + H n {6 ma l e , n ) ~ H n {6 Hn ) + R n (9 male, n) ~ Rn(9 Hn ) 

Hn{9male,n) Rn ( 9 ) T O p (l/'u), 

where 0^ G (9 rna [ en . 9n n ). The last equation follows by combining (A.5) and (A.2) in Lemma A.l. 
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Comparing the left-hand side and right-hand side of (A.6), by definitions of 9 ma i en and 
9 Hn , we must have L n (0 ma /e,n) - L n (0 Hr J > 0 and H n (d ma i e , n ) - H n (9 Hn ) < 0, which implies 
that \H n (6 malen ) — H n (0 Hn )\ = o p (l/n). Furthermore, since H, n {9) is a quadratic function, we 
conclude that \0 ma ie, n - 9 Hn \ = o p (l/y/n). 

The rest of the proof follows from Theorem 2 in Self & Liang (1987) and its application to 
cases 1 and 2 in that paper. □ 


Proof of Theorem 2. Denote 9 Q malen = argsup g e n o L n (0) and 0 ma / e ,n = argsup 0eQ L n {0). Simi¬ 
larly, define the true maximum likelihood estimator as 9® nlen = argsup 0eQo L n {9) and 9 m i e , n = 
argsup 0e Q L n {9). Then it is easy to show that 

Dn = —2 {L n (9^ nalen ) — L n (9 ma i ep f)} 

= — 2 {L n {9 maien ) — L n f9 ma i e , n )} + 2 {L n {9 malen ) — L n (9 ma i e , n )} 


-2 {L n {9° malen ) — L n (9 ma i e)Jl )} + 2{L n (9^ nlen ) — L n (9 m i e>n )} 


(A.7) 


ii 


-2{L n (0® xle ) — L n {9 m i^ n )}. 


D* 


Linder the null hypothesis : 9 G Do, we have shown in the proof of Theorem 1 that 

male,n ^0 I — \A^) an( ^ I @male,n 9 q \ — O p ( 1/\/u). Hence, | @male,n ~ @male,n \ ~ O p {l/yfn). 

Using (A.2) in Lemma A.l, we can show that 

I = ~2 {L n (9 ma ien) — L n {9 ma i en )f + 2 {LniGmal^n) ~ U n (0maZe,n) } 

= — 2{C/ n (0 1 ) — C7 n (0 1 )}(0 ma/en — 0maie,n) = °p(l), 

where 0 X G {9 ma i e n , 9 ma i epi ). 

Lemma 1 in Self & Liang (1987) showed that, under the null hypothesis, H 0 : 9 G D 0 , one 
has 9f lle n — 9° h | = o p (l/A/n) and 1 9 m i epi — 9 Hn | = o p (l/ \/n). Following the proof of Theorem 1, 
we can also show that 1 9® nalen — 9° Hn \ = o p (l/\/n) and 9 rna i e . n — 9n n \ = o p {l|^/n). Therefore 
| Omale,n ~ # mle,n\ = °pi l /V™) and \ d male,n ~ 9 m le,n\ = O p {l/yffi). Then, it is easy to sllOW that 


7 / — —2{L n {9^ ia i en ) — L n (0® nl en )} + 2{L n (9, ma i e ^ n ) — L n (9 m i^ n )} 

= — 2 U n (0 2 )(9male,n ~~ @mle,n) + 2U n (0 3 ) (0 m ale,n ~ 9 m i etn ) = O p (l), 
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where d 2 G {9° ma i e ,rv Om le ,n) ancl #3 e (Omaie,n, & mie,n) and U°(d* 2 ) is the partial derivative function 
of L n {6) in the restricted parameter space fV The last equation follows from the fact l/\/nU®{6) 
and y/nU n (6 ) are bounded for 6 in a “small” neighborhood of 0^ llen and 0 rn i en , respectively. 

By plugging / and II back into equation (A.5), we have D n = D* l + o p ( 1), which implies that 
D n and D* have the same asymptotic distribution. The distribution of D * was derived in Self 
& Liang (1987). The rest of the proof follows from Theorem 3 in Self & Liang (1987) as well as 
its application to cases 4-6 in that paper. □ 
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